33 results found.
Written
Treebank,
Language Type:
Monolingual
Languages:
Afrikaans Akkadian Amharic Ancient Greek Arabic Armenian Assyrian Bambara Basque Belarusian Bhojpuri Breton Bulgarian Buryat Cantonese Catalan Chinese Classical Chinese Coptic Croatian Czech Danish Dutch English Erzya Estonian Faroese Finnish French Galician German Gothic Greek Hebrew Hindi Hindi English Hungarian Indonesian Irish Italian Japanese Karelian Kazakh Komi Permyak Komi Zyrian Korean Kurmanji Latin Latvian Lithuanian Livvi Maltese Marathi Mbya Guarani Moksha Naija North Sami Norwegian Old Church Slavonic Old French Old Russian Persian Polish Portuguese Romanian Russian Sanskrit Scottish Gaelic Serbian Skolt Sami Slovak Slovenian Spanish Swedish Swedish Sign Language Swiss German Tagalog Tamil Telugu Thai Turkish Ukrainian Upper Sorbian Urdu Uyghur Vietnamese Warlpiri Welsh Wolof Yoruba
Availability:
Freely Available
License:
Various
Size:
25 million words Production Status:
Existing-updated
Use:
Parsing and Tagging
-
Paper title:Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
-
Paper track:Written/oral presentation
-
Paper status:Accept Oral
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Joakim Nivre | Universal Dependencies | /N |
Documentation:
https://universaldependencies.org
Written
Corpus,
Language Type:
Multilingual
Languages:
Arabic Bulgarian Catalan Croatian Czech Danish Dutch English Estonian Filipino Finnish French German Greek Hebrew Hindi Hungarian Indonesian Italian Japanese Korean Latvian Lithuanian Malay Norwegian Persian Polish Portuguese Romanian Russian Serbian Simplified Chinese Slovak Slovenian Spanish Swedish Thai Traditional Chinese Turkish Ukrainian Vietnamese
Availability:
Freely Available
License:
CC-BY-SA
Size:
60 GByte Production Status:
Newly created-in progress
Use:
Language Modelling
-
Paper title:Wiki-40B: Multilingual Language Model Dataset
-
Paper track:Written/oral presentation
-
Paper status:Accept Oral
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Rami Al-Rfou | Wiki40B-LM | /N |
Documentation:
None
Written
Corpus,
Language Type:
Monolingual
Languages:
Afrikaans Albanian Arabic Armenian Bangla Basque Bosnian Breton Bulgarian Catalan Croatian Czech Danish Dutch English Esperanto Estonian Filipino Finnish French Galician Georgian German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Kazakh Korean Latvian Lithuanian Macedonian Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Sinhala Slovak Slovenian Spanish Swedish Tamil Telugu Thai Turkish Ukrainian Urdu Vietnamese pt_br ze_en ze_zh zh_cn zh_tw
Availability:
Freely Available
License:
<Not Specified>
Size:
22.10G tokens Production Status:
Existing-used
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs
-
Paper track:Written/oral presentation
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Yo Joong Choe | OpenSubtitles2018 | /N |
Documentation:
Yes, on the website.
Written
Lexicon,
Language Type:
Monolingual
Languages:
Afrikaans Albanian Arabic Armenian Bangla Basque Bosnian Breton Bulgarian Catalan Croatian Czech Danish Dutch English Esperanto Estonian Filipino Finnish French Galician Georgian German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Kazakh Korean Latvian Lithuanian Macedonian Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Sinhala Slovak Slovenian Spanish Swedish Tamil Telugu Thai Turkish Ukrainian Urdu Vietnamese pt_br ze_en ze_zh zh_cn zh_tw
Availability:
Freely Available
License:
CreativeCommons Attribution 4.0 International
Size:
41 GByte Production Status:
Newly created-finished
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs
-
Paper track:Written/oral presentation
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Yo Joong Choe | word2word | /N |
Documentation:
Yes, on the website.
Written
Corpus,
Language Type:
Monolingual
Languages:
Bulgarian Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hungarian Icelandic Irish Italian Latvian Lithuanian Maltese Polish Portuguese Romanian Slovak Slovenian Spanish Swedish
Availability:
Freely Available
License:
CC-0
Size:
341856530 sentences Production Status:
Newly created-in progress
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:ParaCrawl: Web-Scale Acquisition of Parallel Corpora
-
Paper track:Long/Resources and Evaluation
-
Paper status:Accept
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Philipp Koehn | ParaCrawl | /N |
Documentation:
None
Written
Corpus,
Language Type:
Multilingual
Languages:
Afrikaans Albanian Amharic Arabic Aragonese Armenian Assamese Azerbaijani Basque Belarusian Bengali Bosnian Breton Bulgarian Burmese Catalan Central Khmer Chinese Croatian Czech Danish Dutch Dzongkha English Esperanto Estonian Finnish French Gaelic Galician Georgian German Greek Gujarati Hausa Hebrew Hindi Hungarian Icelandic Igbo Indonesian Irish Italian Japanese Kannada Kazakh Kinyarwanda Korean Kurdish Kyrgyz Latvian Limburgan Lithuanian Macedonian Malagasy Malay Malayalam Maltese Marathi Mongolian Nepali Northern Sami Norwegian Norwegian Bokmål Norwegian Nynorsk Occitan Oriya Panjabi Pashto Persian Polish Portuguese Romanian Russian Serbian Serbo-Croatian Sinhala Slovak Slovenian Spanish Swedish Tajik Tamil Tatar Telugu Thai Turkish Turkmen Uighur Ukrainian Urdu Uzbek Vietnamese Walloon Welsh Western Frisian Xhosa Yiddish Yoruba Zulu
Availability:
Freely Available
License:
Size:
55 million sentences Production Status:
Existing-used
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
-
Paper track:Long/Machine Translation
-
Paper status:Accept
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Biao Zhang | the open parallel corpus (OPUS) | /N |
Documentation:
None
Not Applicable
Contextualsed word embeddings,
Language Type:
Monolingual
Languages:
Ancient Arabic Basque Bokmål Bulgarian Catalan Chinese Church Croatian Czech Danish Dutch English Estonian Finnish French Galician German Greek Hebrew Hindi Hungarian Indonesian Irish Italian Japanese Korean Latin Latvian Norwegian Nynorsk Old Persian Polish Portuguese Romanian Russian Simplified Chinese Slavonic Slovak Slovene Spanish Swedish Turkish Ukrainian Urdu Uyghur Vietnamese
Availability:
Freely Available
License:
none
Size:
18.4 GByte Production Status:
Existing-used
Use:
Parsing and Tagging
-
Paper title:Treebank Embedding Vectors for Out-of-domain Dependency Parsing
-
Paper track:Short/Syntax: Tagging, Chunking and Parsing
-
Paper status:Accept
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Joachim Wagner | Elmo For Many Languages | /N |
Documentation:
https://www.aclweb.org/anthology/K18-2005/
Speech/Written
Corpus,
Language Type:
Multilingual
Languages:
Arabic English French German Greek Italian Portuguese Russian Spanish
Availability:
Freely Available
License:
CC BY-NC-ND 4.0
Size:
200 Production Status:
Newly created-finished
Use:
Corpus Creation/Annotation
-
Paper title:The Multilingual TEDx Corpus for Speech Recognition and Translation
-
Paper track:12.6 Speech and multimodal resources/Oral Presentation
-
Paper status:Accept
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Elizabeth Salesky | Multilingual TEDx (mTEDx) | /N |
Documentation:
None
Speech/Written
Corpus,
Language Type:
Multilingual
Languages:
Basque Belgian Dutch Croatian Czech Galician Greek Hungarian Portuguese Slovak Slovenian Spanish
Availability:
From Owner
License:
Size:
None Production Status:
Existing-used
Use:
Evaluation/Validation
-
Paper title:An Approach to Online Speaker Change Point Detection Using DNNs and WFSTs
-
Paper track:5.4 Speech and audio segmentation/Poster Presentation
-
Paper status:Accept - Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Lukas Mateju | COST278 database | /N |
Documentation:
None
Speech
Corpus,
Language Type:
Multilingual
Languages:
Cantonese English French German Gishu Greek Gujarati Hebrew Hindi Indonesian Japanese Korean Mandarin Persian Portuguese Runyankore Russian Spanish Turkish Vietnamese
Availability:
Freely Available
License:
OpenSource
Size:
22.8 GByte Production Status:
Newly created-in progress
Use:
Speech Recognition/Understanding
-
Paper title:Speaking rate, information density, and information rate in first-language and second-language speech
-
Paper track:1.10 Bilingual and L2 acquisition and processing/Oral Presentation
-
Paper status:Accept - Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Ann Bradlow | The ALLSSTAR Corpus | /N |
Documentation:
Documentation in English is available to the public (via the project website)




